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Abstract: Using a support vector machine requires to set two types of hyperparameters: 
the soft margin parameter C and the parameters of the kernel. To perform this model 
selection task, the method of choice is cross-vaHdation. Its leave-one-out variant is known 
to produce an estimator of the generalization error which is almost unbiased. Its major 
drawback rests in its time requirement. To overcome this difficulty, several upper bounds 
on the leave-one-out error of the pattern recognition SVM have been derived. Among those 
bounds, the most popular one is probably the radius-margin bound. It appHes to the hard 
margin pattern recognition SVM, and by extension to the 2-norm SVM. In this report, 
we introduce a quadratic loss M-SVM, the M-SVM^, as a direct extension of the 2-norm 
SVM to the multi-class case. For this machine, a generalized radius-margin bound is then 
established. 

Key-words: M-SVMs, model selection, leave-one-out error, radius-margin bound. 
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Une SVM multi-classe a cout quadratique 

Resume : La mise en oeuvre d'une machine a vecteurs support requiert la determination 
des valeurs de deux types d'hyper-parametres : le parametre de "marge douce" C et les 
parametres du noyau. Pour effectuer cette tache de selection de modele, la methode de choix 
est la validation croisee. Sa variante "leave-one-out" est connue pour fournir un estimateur 
de I'erreur en generalisation presque sans biais. Son defaut premier reside dans le temps de 
calcul qu'elle necessite. Afin de surmonter cette difEculte, plusieurs majorants de I'erreur 
"leave-one-out" de la SVM calculant des dichotomies ont ete proposes. La plus populaire de 
ces bornes superieures est probablement la borne "rayon-marge". Elle s 'applique a la version 
a marge dure de la machine, et par extension a la variante dite "de norne 2". Ce rapport 
introduit une M-SVM "a cout quadratique", la M-SVM^, comme une extension directe de 
la SVM de norne 2 au cas multi-classe. Pour cette machine, une borne "rayon-marge" 
generalisee est ensuite etablie. 

Mots-cles : M-SVM, selection de modele, erreur "leave-one-out", borne "rayon-marge". 
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1 Introduction 

Using a support vector machine (SVM) pi |4j requires to set two types of hyperparameters: 
the soft margin parameter C and the parameters of the kernel. To perform this model 
selection task, several approaches are available (see for instance fO', '12]). The solution of 
choice consists in applying a cross-validation procedure. Among those procedures, the leave- 
one-out one appears especially attractive, since it is known to produce an estimator of the 
generalization error which is almost unbiased [ll]. The seamy side of things is that it is 
highly time consuming. This is the reason why, in recent years, a number of upper bounds 
on the leave-one-out error of pattern recognition SVMs have been proposed in literature (see 
[3] for a survey). Among those bounds, the tightest one is the span bound [16]. However, 
the results of Chapelle and co-workers presented in [3] show that another bound, the radius- 
margin one [15], achieves equivalent performance for model selection while being far simpler 
to compute. This is the reason why it is currently the most popular bound. It applies to the 
hard margin machine and, by extension, to the 2-norm SVM (see for instance Chapter 7 in 

In this report, a multi-class extension of the 2-norm SVM is introduced. This machine, 
named M-SVM^, is a quadratic loss multi-class SVM, i.e., a multi-class SVM (M-SVM) in 
which the ^i-norm on the vector of slack variables has been replaced with a quadratic form. 
The standard M-SVM on which it is based is the one of Lee, Lin and Wahba [TO]. As the 
2-norm SVM, its training algorithm is equivalent to the training algorithm of a hard margin 
machine obtained by a simple change of kernel. We then establish a generalized radius- 
margin bound on the leave-one-out error of the hard margin version of the M-SVM of Lee, 
Lin and Wahba. 

The organization of this paper is as follows. Section [2] presents the multi-class SVMs, by 
describing their common architecture and the general form taken by their different training 
algorithms. It focuses on the M-SVM of Lee, Lin and Wahba. In Section |3l the M-SVM^ 
is introduced as a particular case of quadratic loss M-SVM. Its connection with the hard 
margin version of the M-SVM of Lee, Lin and Wahba is highlighted, as well as the fact that 
it constitutes a multi-class generalization of the 2-norm SVM. Section 0] is devoted to the 
formulation and proof of the corresponding multi-class radius-margin bound. At last, we 
draw conclusions and outline our ongoing research in Section JH 
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2 Multi-Class SVMs 

2.1 Formalization of the learning problem 

We are interested here in multi-class pattern recognition problems. Formally, we consider 
the case of Q-category classification problems with 3 < Q < cx), but our results extend to 
the case of dichotomies. Each object is represented by its description x & X and the set 
y of the categories y can be identified with the set of indexes of the categories: |1,Q]. 
We assume that the link between objects and categories can be described by an unknown 
probability measure P on the product space Xxy. The aim of the learning problem consists 
in selecting in a set G of functions g — {gk)i<k<Q from X into MP a function classifying 
data in an optimal way. The criterion of op^imality must be specified. The function g 
assigns a; € <%" to the category I if and only if gi{x) > maxfe^/5fc(x). In case of ex ^quo, 
X is assigned to a dummy category denoted by *. Let / be the decision function (from X 
into 3^U{*}) associated with g. With these definitions at hand, the objective function to 
be minimized is the probability of error P {f {X) ^ Y). The optimization process, called 
training, is based on empirical data. More precisely, we assume that there exists a random 
pair (X, Y) ^ X X y , distributed according to P, and we are provided with a m,-sample 
An = ((-'^i:^i))i<i<m of independent copies of (X,Y). 

There are two questions raised by such problems: how to properly choose the class of 
functions Q and how to determine the best candidate g* in this class, using only Dm- This 
report addresses the first question, named model selection, in the particular case when the 
model considered is a M-SVM. The second question, named function selection, is addressed 
for instance in [8J. 

2.2 Architecture and training algorithms 

M-SVMs, like all the SVMs, belong to the family of kernel machines. As such, they operate 
on a class of functions induced by a positive semidefinite (Mercer) kernel. This calls for the 
formulation of some definitions and propositions. 

Definition 1 (Positive semidefinite kernel) A positive semidefinite kernel k on the set 
X is a continuous and symmetric function k : X^ -^ M verifying: 

n n 

\fn G N*, V(a;,)i<,<„ G A"', V(aOi<.<„ e »", ^ ^ a,a,K (x,, a;^) > 0. 

i=i ]=i 

Definition 2 (Reproducing kernel Hilbert space [Ij) Lei (H, (■,)h) 6e a Hilbert space 
of functions on X (H C M'^/ A function k : X^ —^Risa reproducing kernel ofH if and 
only if: 

1. Vx G A', Kj; = K {x, ■) G H; 

2. Va; G A, V/i G H, {h, Kx)u = h{x) (reproducing property). 
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A Hubert space of functions which possesses a reproducing kernel is called a reproducing 
kernel Hilbert space (RKHS). 

Proposition 1 Let (H^, (•, •)h^) ie a RKHS of functions on X with reproducing kernel k. 
Then, there exists a map $ from X into a Hilbert space (i?$(A'), (', •)) such that: 

V {x, x) eX^, K {x, x') = ($ (a;) , $ {x')). (1) 

$ is called a feature map and E^(^x) « feature space. 

The connection between positive semidefinite kernels and RKHS is the following. 

Proposition 2 If k is a positive semidefinite kernel on X, then there exists a RKHS 
(H, (•, •)h) of functions on X such that k is a reproducing kernel o/H. 

Let K be a positive semidefinite kernel on X and let (H^, (•, ■)h„) be the RKHS spanned 
by K. Let Ti = (H^, (•, •)h„)^ and let H = ((H„, (•, •)hJ + {I})''- By construction, H is 
the class of vector-valued functions h — (/ife)i<fe<Q on X such that 

'^Pikn{Xik,-) + bk 

»=1 / l<k<Q 

where the Xik are elements of X, as well as the limits of these functions when the sets 
{xik ■ I < i < rrik} become dense in X in the norm induced by the dot product (see for 
instance [U]). Due to Equation [H H can be seen as a multivariate afRne model on $ {X). 
Functions h can then be rewritten as: 

K) = i{wk,-) + h)i<k<Q 

where the vectors Wk are elements of i?$(A')- They are thus described by the pair (w,b) 

with w = {wk)i<^k<Q ^ ^^(x) ^^'^ ^ ~ (^'=)i<fc<Q ^ ^^- ^^ ^ consequence, H can be seen 
as a multivariate linear model on $ (X), endowed with a norm ||.||^ given by: 



When, /iL = 

' II II ri 



\ fc=i 



where H^fcH = y^{wk, Wk). With these definitions and propositions at hand, a generic 
definition of the M-SVMs can be formulated as follows. 

Definition 3 (M-SVM, Definition 42 in [§]) Let i{xt,yz))^^^<^,^ £ {X x |1,(5])" and 

A G M.*f_. A Q-category M-SVM is a large margin discriminant model obtained by minimizing 
over the hyperplane X]fc=i hk ^ of H a penalized risk Jm-svm of the form: 

m 

Jm-svm (h) = 22^M-svM{yi,h{xi)) + X ||/i||.^ 

2=1 

where the data fit component involves a loss function £m-svm which is convex. 
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Three main models of M-SVMs can be found in literature. The oldest one is the model 
of Weston and Watkins [19], which corresponds to the loss function iww given by: 

iww{y, Hx)) ^^{^- hy{x) + hk{x))^ , 

where the hinge loss function (•)+ is the function max(0, •). The second one is due to 
Crammer and Singer |5] and corresponds to the loss function ^cs given by: 

icsiy, Hx)) = 1 - hy{x) + maxhkix) 

The most recent model is the one of Lee, Lin and Wahba [T^ which corresponds to the loss 
function £llw given by: 

Among the three models, the M-SVM of Lee, Lin and Wahba is the only one that implements 
asymptotically the Bayes decision rule. It is Fisher consistent [20tll4j. 

2.3 The M-SVM of Lee, Lin and Wahba 

The substitution in Definition[3]of £m-svm with the expression of the loss function ^llw given 
by Equation[2]provides us with the expressions of the quadratic programming (QP) problems 
corresponding to the training algorithms of the hard margin and soft margin versions of the 
M-SVM of Lee, Lin and Wahba. 

Problem 1 (Hard margin M-SVM) 

min Jhm (w, b) 

w.b 

K, $(x,)) +bk< -Q^, (1 < i < to), (1 < fc ^ y, < Q) 



where 



Q 



Jhm (w,b) = -^llwfcll 



2 

k=l 



Problem 2 (Soft margin M-SVM) 



min JsM {^, b) 

W.b 
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' {wk, $(a;0) +bk< -Q^ + ^^k, (1 < J < m), {I < k ^ y, < Q) 
6fc>0, {l<t<m),{l<k^y,<Q) 



s.t. < 



where 






k—1 2—1 fc^^yi 



In Problem [21 the ^^^ are s/acA; variables introduced in order to relax the constraints of 
correct classification. The coefficient C, which characterizes the trade-off between prediction 
accuracy on the training set and smoothness of the solution, can be expressed in terms of the 
regularization coefficient A as follows: C = (2A)~^. It is cafied the soft margin parameter. 
Instead of directly solving Problems [T] and [2 one usuafiy solves their Wolfe dual [6] . We 
now derive the dual problem of Problem [ij Giving the details of the implementation of the 
Lagrangian duality will provide us with partial results which will prove useful in the sequel. 

Let a = {ctik)i<ci<rn i<k<Q "= IR+ " be the vector of Lagrange multipliers associated 
with the constraints of good classification. It is for convenience of notation that this vec- 
tor is expressed with double subscript and that the dummy variables aiy., all equal to 0, 
are introduced. Let 6 e E^i^x) be the Lagrange multipHer associated with the constraint 

SfeLi Wk — and /3 S R the Lagrange multiplier associated with the constraint X]fc=i ^k = 0. 
The Lagrangian function of Problem [T] is given by: 

L (w, b, a, /?, S) = 

-, Q Q Q m Q , s 

2 E ll^'^ll' - ('5,E^fc) -'^E^'^ + EE"'fc (u;fe,$(:r,)) +fefe + ^^ ■ (3) 

fc=l fc=l fe=l i=\ k=\ ^ V / 

Setting the gradient of the Lagrangian function with respect to wu equal to the null vector 
provides us with Q alternative expressions for the optimal value of vector 8: 

m 

5*^wl^Y.<kH^^)^ (1<A;<Q). (4) 

i=l 

Since by hypothesis, Ylk=i ^fc — ^^ summing over the index k provides us with the expression 
of 5* as a function of dual variables only: 

1 _"' Q 

'^*=nEE"*'c'^(^»)- (S) 

^ i=i fe=i 
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By substitution into ((H, we get the expression of the vectors Wk at the optimum: 

^^ = n E E "^''^(^^) - E "*fc'^(^0' (1 < fc < Q) 
which can also be written as 



m Q , s 

<=EE"^Q-'^MJ'i>(^.), (i<fc<Q) (6) 



i=l Z=l 



where (5 is the Kronecker symbol. 

Let us now set the gradient of ([3]) with respect to b equal to the null vector. It comes: 

771 

and thus 

771 Q 



m Q , . 

4=1 i=l 



Given the constraint X]fe=i ^fe — ^^j this implies that: 



m Q Q 

j=i fc=i fe=i 



By appHcation of ^ 

Q Q m Q , ^ V m Q 



Hx,)) 



'-1 •^ ■III. '-I / 1 \ "' V / 1 \ 

EiKii'-E(EE<^ n-^'=''r^^')'EE";Jo-^^-r( 

fc=l fe=l 1=1 i=l ^^ ^ j=ln=l ^^ ^ 

rn m Q Q Q , \ / 1 \ 

= EEEE">;«(*(^^)'*(^.))E o"^''^'' o"'^'^- 

m m Q Q / ^ \ 

= E E E E "^'"j" ( '^'^" " n ) '*(^*' ^j)- (^) 



j=i j=i ;=i T!=i ^ ^ ^ 

Still by appHcation of |(6]), 

m Q m Q m Q / ^ \ 

E E "^fc (^fc' *(^»)) = E E "^fc^E E "jj ( o - '^'=^' ) *(2^j)' *(2^o) 

i=l fc=l i=l fc=l i=l Z=l ^^ ^ 



A Quadratic Loss Multi- Class SVM 







m m Q Q 

j=i j=i k=i 1=1 


V<3 / 


E,,a;j). 


Combining 


® 


and ^ gives: 












^ Q ra Q 


1 

~ ~2 


Q 

■>;iK 






^ m m Q Q , 


— j K(x,,a;j) 



(9) 



(10) 

In what follows, we use the notation e^ to designate the vector of M" such that all its 
components are equal to e. Let H be the matrix of M.Qm,Qra (R) of general term: 

hih.fl = f 4j - -^ j K{xi,Xj). 

With these notations at hand, reporting ((7]) and lfTO|) in |(3]) provides us with the algebraic 
expression of the Lagrangian function at the optimum: 

L{a*) = -^a*^Ha* + -J-^l^^^a* . 

This eventually provides us with the Wolfe dual formulation of Problem [TJ 
Problem 3 (Hard margin M-SVM, dual formulation) 

max Jiiiv,d(a) 

a 

(a,k > 0, (1 < z < m), (1 < fc ^ y, < Q) 
'■'■ IE™! E^ii -a (^ - 4,) - 0, (1 < fc < Q) 
where 

wji/i i/ie general term of the Hessian matrix H being 

hiks = [ 4.( - j: j K{xi,Xj). 

Let the couple (w°,b") denote the optimal solution of Problem [l] and equivalently, let 
a° — {ct'ik)i<^ <: i<k<o ^ ^+ be the optimal solution of Problem[3l According to ([6]), the 
expression of w° is then: 



m Q 



^l-mi^ui^-SkAHx.). 



i=l 1=1 
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2.4 Geometrical margins 

From a geometrical point of view, the algorithms described above tend to construct a set 
of hyperplanes {{wk^bk) ■ I < k < Q} that maximize globally the Cq margins between the 
diflerents categories. If these margins are defined as in the bi-class case, their analytical 
expression is more complex. 

Definition 4 (Geometrical margins, Definition 7 in [7j) Let us consider a Q-category 
M-SVM (a function ofH) classifying the examples of its training set {{xi^yt) : 1 < i < m} 
without error, jki, its margin between categories k and I, is defined as the smallest distance 
of a point either in k or I to the hyperplane separating those categories. Let us denote 



d M-SVM = niin -I min 

l<fe<i<Q 



min (hkixi) - hi{xi)) , min {hi{xj) - hkixj)) 

i:y i=k j-Vj=l 



and for 1 < k < I < Q, let dM-svMM he: 
1 



t'M-SVM.kl 



d M-SVM 

Then we have: 



min {hk{xi) - hi{xi) - dM-sVM) , min {hi{xj) - hk{xj) - dM-sVM) 

i:yi=k j-yj=l 



1 + d M-SVM. kl 
Ikl = U-M-SVM- 



\\wk -wi\\ 



Given the constraints of Problem [1] the expression of cJm-svm corresponding to the M-SVM 
of Lee, Lin and Wahba is: 

Q 



dhLW 



1' 



Remark 1 The values of the parameters dM-svM,ki (or d^^wM in the case of interest) are 
known as soon as the pair (w",b") is known. 

The connection between the geometrical margins and the penalizer of Jm-svm is given 
by the following equation: 

Q 

j2\\^k-wif^Qj2\\w,r, (11) 

k<l k=l 

the proof of which can for instance be found in Chapter 2 of ■ We introduce now a result 
needed in the proof of the master theorem of this report. 

Proposition 3 For the hard margin M-SVM of Lee, Lin and Wahba, we have: 

2 Q 
Q •'K-^ f '^ + dLLW,kl\ Y^ii 0|i2 qT TT 1 iT 
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Proof 

• (Q-i)2 Z^fc<; \^ -y^, J — Z^fc=i ll"'fcll 

This equation is a direct consequence of Definition |4] and Equation [TTJ 

This is a direct consequence of Equation [10] and the definition of matrix H. 

One of the Kuhn- Tucker optimaUty conditions is: 

a"fe ('(u'^, ^(x,)) + bl + ^^\ - 0, (1 < * < m), (1 < fc ^ y, < Q), 
and thus: 

By application of ([7]), this simpUfies into 

Q 



5:^a°,K,$(.,)) + ^l§„a" = 0. 



i=i fe=i 
Since 

m Q 

i=i fc=i 
is a direct consequence of ifTO]) . this concludes the proof. 
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3 The M-SVM^ 

3.1 Quadratic loss multi-class SVMs: motivation and principle 



The M-SVMs presented in Section 12.21 share a common feature with the standard pattern 
recognition SVM: the contribution of the slack variables to their objective functions is linear. 
Let S, be the vector of these variables. In the cases of the M-SVMs of Weston and Watkins 
and Lee, Lin and Wahba, we have ^ = {iik)i<i<rn,i<k<Q ^it^ i^ivi)i<i<m = ^m, and in the 
case of the model of Crammer and Singer, it is simply ^ = (^i)i<i<m- ■'■^ both cases, the 
contribution to the objective function is C||^||i. 

In the bi-class case, there exists a variant of the standard SVM which is known as the 
2-norm SVM since for this machine, the empirical contribution to the objective function 
is C||^||2. Its main advantage, underlined for instance in the Chapter 7 of [Hj, is that its 
training algorithm can be expressed, after an appropriate change of kernel, as the training 
algorithm of a hard margin machine. As a consequence, its leave-one-out error can be upper 
bounded thanks to the radius-margin bound. 

Unfortunately, a naive extension of the 2-norm SVM to the multi-class case, resulting 
from substituting in the objective function of either of the three M-SVMs ||^||i with H^IH, 
does not preserve this property. Section 2.4.1.4 of [7|| gives detailed explanations about that 
point. The strategy that we propose to exhibit interesting multi-class generalizations of 
the 2-norm SVM consists in studying the class of quadratic loss M-SVMs, i.e., the class of 
extensions of the M-SVMs such that the contribution of the slack variables is a quadratic 
form: 

rn m Q Q 

i=i j=i fc=i 1=1 

where M ~ {mik,ji)i^^ ,<„ ]^<j. ;<q is a symmetric positive semidefinite matrix. 

3.2 The M-SVM^ as a multi-class generalization of the 2-norm SVM 

In this section, we establish that the idea introduced above provides us with a solution to 
the problem of interest when the M-SVM used is the one of Lee, Lin and Wahba and the 
general term of the matrix M is rriikji = (Sk,i — ^ ) Sij. The corresponding machine, named 
M-SVM^, generalizes the 2-norm SVM to an arbitrary (but finite) number of categories. 

Problem 4 (M-SVM^) 

min J M-SVM' (w, b) 

' (wfe, $(x,)) + 6fe < -Q^ + e,:fc, (1 < J < m), (1 < fc ^ y, < Q) 
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where 






fc=l i=l i=l fe=l 1=1 ^ ^ 



Note that as in the bi-class case, it is useless to introduce nonnegativity constraints for the 
slack variables. The Lagrangian function associated with Problem [4] is thus 

L{w,h,£„a,/3,5) = 
. Q Q Q 

k=l k=l k=l 

m Q 



E E "*M ^^'^ ,Hx^))+bk + ^^ - ^^k y (12) 



Q- 

i=i k=i ^ ^ 

Setting the gradient of L with respect to ^ equal to the null vector gives 

2CMC = a* (13) 

which has for immediate consequence that 

CC^MC - a*'^C = -CC'^M^*. (14) 

Using the same reasoning that we used to derive the objective function of Problem [3] and 
m]), at the optimum, lfT2l) simplifies into: 

L (r , a*) = -^a*^Ha* ~ CC^MC + ^^l^^a*. (15) 

Besides, using (fT3| . 

(^*n<^2p = 4(7^ Y^ (dk,„ - — j Cfc E ( '^''P ~ o)^'' 

and thus 

^ Q e / 1 1 \ , 



^ Q' 



By a double summation over n and p, we have: 

Q Q Q Q 



n=lp=l ^ ^^ fe=l i=l n=lp=l ^ W ^ , 



1 1 



E E "ma*p (^n.p - TT ) = 4C^ E E ^«fc^^*' E E ^k,nSl,p - {Sk,n + S^r,)^ + -^ \ [ S, 
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Since 

this simplifies into 

E E "»«; '^-p - o = ^^' E E ^^^^ o ^^'^^'- 

n=lp=l ^ ^^ fe=l i=l ^ ^^ 

Finally, a double summation over i and j implies that 
A substitution into lfT5|) provides us with: 

As in the case of the hard margin version of the M-SVM of Lee, Lin and Wahba, setting the 
gradient of lfT2|) with respect to b equal to the null vector gives: 



EE^Mn-^^-'O =0, (i<fc<g). 

Putting things together, we obtain the following expression for the dual problem of Prob- 
lem H 

Problem 5 (M-SVM^ dual formulation) 

maxJM.sv-Ji^.dCa) 

J a,fc > 0, (1 < « < m), (1 < fc ^ w^ < Q) 
'■*■ \E™ 1 E?=i «.; (^ - 4.) =0, (1 < fc < Q) 
where 

JM-svM-\d{a) = --a^ [H + ^^n " + Q— ylgm"- 

Due to the definitions of the matrices H and M, this is precisely Problem [3] with the 
kernel k replaced by a kernel k' such that: 

K'{xi,x-j) = K{xi,Xj) + —S^j, (1 <ij< m). 

When Q = 2, the M-SVM of Lee, Lin and Wahba, like the two other ones, is equivalent 
to the standard bi-class SVM (see for instance U)- Furthermore, in that case, we get 
(^M£, = \U\\l. The M-SVM^ is thus equivalent to the 2-norm SVM. 



A Quadratic Loss Multi- Class SVM 15 

4 Multi-Class Radius-Margin Bound on the Leave-One- 
Out Error of the M-SVM^ 

To begin with, we must recall Vapnik's initial bi-class theorem (see Chapter 10 of |T5]). 
which is based on an intermediate result of central importance known as the "key lemma". 

4.1 Bi-class radius-margin bound 

Lemma 1 (Bi-class key lemma) Let us consider a hard margin bi-class SVM on a do- 
main X. Suppose that it is trained on a set dm = {{xi,yi) : 1 < i < m} of m couples of 
X X { — 1, 1} (the points of which it separates without error). Consider now the same ma- 
chine, trained on dm \ {{xp,y,p)}. Lf it makes an error on {xp,y,p), then the inequality 

m 

holds, where Vm is the diameter of the smallest sphere containing the images by the feature 
map of the support vectors of the initial machine. 

Theorem 1 (Bi-class radius-margin bound) Let 7 be the geometrical margin of the 
hard margin SVM defined in Lemma{^ when trained on d„i. Let also Cm be the number 
of errors resulting from applying a leave-one-out cross-validation procedure to this machine. 
We have: 

C < —Hi 

T 

The multi-class radius-margin bound that we propose in this report is a direct general- 
ization of the one proposed by Vapnik. The first step of the proof consists in establishing a 
"multi-class key lemma". This is the subject of the following subsection. 

4.2 Multi-class key lemma 

Lemma 2 (Multi-class key lemma) Let us consider a Q-category hard margin M-SVM 

of Lee, Lin and Wahba on a domain X. Let dm = {{xi,yi) ■ I <i < rn} be its training set. 

Consider now the same machine trained on dm \ {{xp, yp)}- If it makes an error on {xp, yp), 

then the inequality 

n 1 

max az,i, > 



mim ^'^ " Q{Q - i)t^I 

holds, where T>m is the diameter of the smallest sphere of the feature space containing the 
set {^{xi) : I < i < m}. 

Proof Let (w^, b^) be the couple characterizing the optimal hyperplanes when the machine 
is trained on d,„ \ {{xp, yp)}. Let 

("?!> • ■ • . "L-DQ' 0, . . . , 0, aL+i)i, . . . , al^gf 
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be the corresponding vector of dual variables, a^ belongs to R_^™, with ( a^^, j = Oq- 

This representation is used to characterize directly the second M-SVM with respect to the 
first one. Indeed, a^ is an optimal solution of Problem [3] under the additional constraint 
("pfe)i<fe<Q = Oq- Let us define two more vectors in R^", A^ = {Kk)i<i<"iA<k<Q and 
A*^ = (Mffc)i<i<m,i<fc<Q- AP satisfies additional properties so that the vector a° — A^ is a 
feasible solution of Problem [3] under the additional constraint that [oipf. — A^j, 
i.e., a" — A^ satisfies the same constraints as a^. We have 

V* y^ p, Vfc / y„ a^, - Af, > ^^ Af, < a^. 
We deduce from the equality constraints of Problem [3] that: 



l<fc<Q 



Q '"'' 

i=l 1=1 \ -V / -^-^ i^^ \^ 

To sum up, vector A^ satisfies the following constraints: 

Vt^p,Vk, 0<Af, <aO, . (16) 

Er=iEaiAS(^-4,)=o, (i<fc<Q) 

The properties of vector /i^ are such that a^ + Ki^p satisfies the constraints of the same 
problem, where Ki is a positive scalar the value of which will be specified in the sequel. We 
have thus: 

Moreover, we have 

Finally, 

m Q , . m Q , . 

EEK + -/^S)^-4, =o^EE/^mA-^m =0. 

To sum up, vector pP satisfies the following constraints: 

Vi,Vfc^y„ M?fc>0 . (17) 

E::iEaiMr,(^-4,) = o, (i<fc<Q) 

In the sequel, for the sake of simplicity, we write J in place of Jllw.cI- By construction of 
vectors A^ and /i^, we have J(cP — }P) < J{aP) and J {a^ + KipP) < J{a°), and by way of 
consequence, 

J{a°) - J(a° - A^) > J(a") - J(aP) > J {a^ + KifiP) - J{aP). (18) 
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The expression of the first term is 



Given ^ and the definition of matrix H 

T 



Ha' + Q^^Qrn) >^'=T.J2 (("'"' *(^0) + ^ 



2^1 k^iji 



\p. 



AL 



(19) 



my _, \ m 



^\p 



(20) 

Due to the constraints of correct classification and the nonnegativity of the components of 
vector XP, the first double sum of the right-hand side of l(20|) is nonpositive. Furthermore, 
making use of the equality constraints of (fTGl and J2k=i ^k — ^ gives: 

EE«-E^^E^f.= E^n EEAArO=o. 

1=1 fc=l fc=l i=l \*:=1 / \i=l 1=1 ^ / 



Thus, 



A substitution into (fTOl) provides us with the following upper bound on J(a°) — J{a' — A^): 



J(a") - J(a" - A?') < -XP'^HXP, 



and equivalently, by definition of H, 



fc=i 



i=l 1=1 



Q 



(21) 



We now turn to the right-hand side of (fTSl) . The line of reasoning already used for the 
left-hand side gives: 

J (aP + K^iP) ~ JiaP) = 



with 



^i(-^"'' + q^1q™) ^'-^E EEA^n^-'5.,j$(..; 

-HaP + q^1Q"0 /^' = E E [K (^0 + qT 



(22) 



(23) 
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By hypothesis, the M-SVM trained on d„i \ {{xp,yp)} does not classify Xp correctly. This 
means that there exists ?^ e |1, Q] \ {j/p} such that h^ (xp) > 0. Let I be a mapping from 
|l,Ql\{n} to |1,to1\{p} such that 

Vfce[l,Q]\M, 4(,)„>0. 

We know that such a mapping exists, otherwise, given the equality constraints of Problem[3l 
vector a^ would be equal to the null vector. For K2 S M^, let ijP be the vector of R'^™ that 
only differs from the null vector in the following way: 

UU = K2 

\vfcGii,Qi\M, M^(,), = i^2 • 

Obviously, this solution is feasible (satisfies the constraints [T7|) . Indeed, i Y^^=i Y^k=i /^ffc ~ 
K2 and X^illi ^Ak — -^2, (1 < ^ < Q)- With this definition of vector /i^, the right-hand side 
of ((23l) simplifies into: 






Q 



1 



Vector ^P has been specified so as to make it possible to exhibit a nontrivial lower bound 
on this last expression. By definition of n, /i^ {xp) > 0. Furthermore, the Kuhn- Tucker 
optimality conditions: 



«rJK,*(^.))+&? + 



Q-l 



0, il<i7^P<m),{l<ky^y,<Q) 



imply that (^h^ (•^i(fe)))i<fc=^ <o ^ ^cT^^Q-^- ^^ ^ consequence, a lower bound on the 
right-hand side of ([23l) is provided by: 



EE^n-0 



i = l k^yi 

It springs from this bound and l(22|) that 



Q-l 



Ko 



IJ-lk 



Q-l 



J [qP + Ki^i^) - J{aP) > 



K1K2 Kl 
Q-l 



E 
fc=i 



m Q 



EEms(A-^''^^0'^(-') 



i=l 1=1 



Q 



(24) 



Combining JTH), JH]) and ^ finally gives: 

Je EEas(A-M^(-^) 



fc=i 



=1 ;=i 



Q 



> 
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K1K2 K( 
Q-l 



2 Q 



E 



m Q 



EEa^M^- 4, )$(..) 



i=l 1=1 



Q 



(25) 



Let u^ = {i'ff.)i<i<m.i<k<Q be the vector of W^™" such that yP — K2i'^ . The value of the 
scalar ^3 = K1K2 maximizing the right-hand side of ((25l) is: 



Kl = 



By substitution in ((25l) . this means that: 



2 ■ 



(Q-ifE 



fc=l 



m Q , ^ 



i=l 1=1 



3fc,i 



$(a;.) 



E 



m Q , 

EE-r^f^ 



i=l i=l 



Q 



3fe,i 



$(a;.) 



> 1. 



For 77 in RQ™, let /^(ry) = ^ E" 1 Eti ^ffc- We have: 



m Q 



^EE^rz'i>(^')-E^r.'i>(^ 



j=i ;=i 



= if (A?')" ||convi($(a;,)) - conv2($(x,))||' 



where convi($(a;i)) and conv2($(a;i)) are two convex combinations of the $(xi). As a 
consequence, ||convi(<i>(a;i)) — conv2($(a;i))|| can be bounded from above by I?^. Since the 
same reasoning applies to v'^ , we get: 



2^4 



[Q)-\YQ'K(XrK{vPrvl^>\ 



(26) 



By construction, K {i/P) = 1. We now construct a vector A^ minimizing the objective 
function K. First, note that due to the equality constraints satisfied by this vector. 



m ^ m Q 

vfeeii,Qi, E^r. = 75EE^rr 



i=l 1=1 



As a consequence. 



This implies that: 



V(fc,Oe[i,Qf, E^'fc = E^?^ 

771 

VfceIl,Ql, yAP^> max a",. 
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Obviously, both the box constraints in l|T6|) and the nature of K call for the choice of small 
values for the components Af^. Thus, there is a feasible solution A^* such that: 

ni 

Vfce|l,Q], yxV^ max a°,. 

i—1 

This solution is such that K (X^*) ~ '^^'^k^\,Q\oPpk- The substitution of the values oiK (i/P) 
and K {\P*) in ((26|) provides us with: 

\' 1 

max a' t > 



,fce[[i,Q]] ^v " (Q-i)2g2p4/ 

Taking the square root of both sides concludes the proof of the lemma. ■ 

4.3 Multi-class radius-margin bound 

Theorem 2 (Multi-class radius-margin bound) Let us consider a Q-category hard mar- 
gin M-SVM of Lee, Lin and Wahba on a domain X . Let dm = {{xi, yt) ■ I < i < rn} be its 
training set, Cm the number of errors resulting from applying a leave-one-out cross-validation 
procedure to this machine, and Vm the diameter of the smallest sphere of the feature space 
containing the set {'^{x-i) : 1 < i < m}. Then the following upper bound holds true: 



Cm<Q'VlY,(- 



d-LLW-kl 



k<l - ^"-^ 



Proof Lemma [H exhibits a non trivial lower bound on max^gji gjoipj. when the machine 
trained on the set dm \ {ixp,yp)} makes an error on {xp,yp), i.e., when {xp,yp) contributes 
to Cm- As a consequence, 

m p 



i=l 



According to Proposition O Ig^a" = -^Y.k<i ( ^^''7™''" ) ' ^ substitution in ^ thus 
provides us with the result announced. ■ 
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5 Conclusions and Future Work 

In this report, we have introduced a variant of the M-SVM of Lee, Lin and Wahba that 
strictly generaHzes to the multi-class case the 2-norm SVM. For this quadratic loss M-SVM, 
named M-SVM^, we have then estabHshed a generalization of Vapnik's radius-margin bound. 
We conjecture that this bound could be improved by a Q^ factor. As it is, it can already 
be compared with those proposed in [T^ for model selection. This, with a general study of 
the quadratic loss M-SVMs, is the subject of an ongoing research. 
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